2022-05-06

Missing data mechanisms

Types of missing data

In research, missing data occur when a data value is unavailable. Many empirical studies encounter missing data. Missing data can occur in many stages of research due to many different causes in many different forms.

  • Non-response: an invited respondent does not participate in the study.
  • Intermittent missing data: missing data on one or more of the measured variables that are used as a predictor, covariate or outcome.
  • drop-out or loss to follow-up: participants in a longitudinal study do not show up at one or more repeated measurement occasions.

Each type of missing data may have different reasons, and also different implication for the methods to deal with the missing data.

Missing data mechanisms

The underlying causes of missing data as missing data mechanisms and were first described by Rubin (1976).

Rubin distinguished three missing data mechanisms:

  • missing completely at random (MCAR)
  • missing at random (MAR)
  • missing not at random (MNAR)

Missing completely at random (MCAR)

Missing data are MCAR when the probability of missing data on a variable is unrelated to any other measured variable and is unrelated to the variable with missing values itself.

In other words the missingness on the variable is completely unsystematic.


Data example

Below the description of the complete data example. We will use this example to show the implications of each missing data mechanism.

Description of fully observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0.00 2.21 0.07 0.02 2.25 -7.42 6.69 14.11 -0.13 -0.13 0.07
X2 2 1000 0.09 2.26 0.14 0.12 2.24 -6.73 6.58 13.31 -0.12 -0.15 0.07
X3 3 1000 -0.03 2.25 0.02 -0.04 2.27 -7.27 6.54 13.81 0.00 -0.22 0.07

MCAR data example

When we create MCAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have not changed much:

Description of fully observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0 2.21 0.07 0.02 2.25 -7.42 6.69 14.11 -0.13 -0.13 0.07
Description of MCAR data - 50%
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 490 0.01 2.1 0.02 0.04 2.14 -7.38 5.58 12.96 -0.18 -0.1 0.09

MCAR distribution

Probabily for MCAR

We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.

mcar <- mcar %>% mutate(R1 = is.na(X1))

Missing at random (MAR)

Missing data are MAR when the probability of missing data on a variable is related to some other measured variable in the model, but not to the value of the variable with missing values itself.

For example, older people more often have missing values for IQ. In that case the probability of missing data on IQ is related to age.


MAR data example

When we create MAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have changed:

Description of fully observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0 2.21 0.07 0.02 2.25 -7.42 6.69 14.11 -0.13 -0.13 0.07
Description of MAR data - 50%
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 512 -0.28 2.25 -0.27 -0.26 2.3 -7.42 5.34 12.76 -0.13 -0.25 0.1

MAR distribution

Probability of MAR

We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.

mar <- mar %>% mutate(R1 = is.na(X1))

The difference between the group with missing values (TRUE) and the group without missing values (FALSE) shows that having missing data is related to the scores on the other variables.

Missing not at random (MNAR)

Data are MNAR when the missing values on a variable are related to the values of that variable itself, even after controlling for other variables.

For example, when weight data are missing mostly for the more heavy persons.


MNAR data example

When we create MNAR data for 50% of the subject in variable X1 we see that the statistics for variable X1 have changed:

Description of fully observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0 2.21 0.07 0.02 2.25 -7.42 6.69 14.11 -0.13 -0.13 0.07
Description of MNAR data - 50%
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 488 -1.12 1.96 -1.17 -1.18 1.53 -7.42 5.11 12.53 0.26 0.73 0.09

MNAR distribution

Probability of MNAR

We can create a missing data indicator variable R1 to explore differences between the subjects with missing data and the subjects without missing data.

mnar <- mnar %>% mutate(R1 = is.na(X1))

The difference between the group with missing values (TRUE) and the group without missing values (FALSE) shows that having missing data is related to the scores on the other variables.

Evaluate the missing data mechanism

Reason for missing data

Any information about the research process can provide valuable information that helps to evaluate and make assumptions about the missing data mechanism.

Why are data missing?


Missing data mechanisms

  • Missing completely at random: missing data is a completely random subsample of the observed data.
  • Missing at random: probability of missing data is related to other measured variables.
  • Missing not at random: probability of missing data is related to the missing data itself, and other measured variables.

Testing the mechanisms

The missing data mechanisms are defined by the probability that missing data occur.

Probability is not related to other measured variables

  • Assume the remaining sample is a totally random subsample (MCAR).

Other measured variables are related tot the probability of missing data

  • Assume the data are not MCAR. However, we cannot definitively rule out MNAR, because we in practice we never know the missing data itself.

Statistical tests

The essence of testing for MCAR is to compare the group with missing data to the group without missing data.

Univariate testing

  • Independent samples t-test to compare for continuous measures
  • Chi-square test to compare for categorical measures

Multivariate testing

  • Logistic regression to evaluate multivariately
  • Little’s MCAR test

T-test to evaluate MCAR

Independent samples T-test to compare the mean of continuous variables between the group with missing data to the group without missing data.

Note that the T-test assumes normally distributed data and homogeneity of variance.

T-test

MCAR example

t.test(X2 ~ R1, data = mcar)
## 
##  Welch Two Sample t-test
## 
## data:  X2 by R1
## t = -0.11866, df = 996.42, p-value = 0.9056
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -0.2970947  0.2632135
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##          0.07756972          0.09451031

T-test

MAR example

t.test(X2 ~ R1, data = mar)
## 
##  Welch Two Sample t-test
## 
## data:  X2 by R1
## t = -13.814, df = 994.06, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -2.064496 -1.550917
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##          -0.7959512           1.0117550

T-test

MNAR example

t.test(X2 ~ R1, data = mnar)
## 
##  Welch Two Sample t-test
## 
## data:  X2 by R1
## t = -3.9996, df = 973.87, p-value = 6.824e-05
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -0.8467042 -0.2893187
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##          -0.2046124           0.3633990

T-test

Univariate method.

When there are no significant differences we may assume the data are MCAR. Otherwise, we assume not-MCAR (i.e. MAR or MNAR).

Note that we can never truly rule out MNAR.

Chi-square test to evaluate MCAR

Chi-square test to compare the categorical variables for the group with missing data to the group without missing data.

Test to compare the distribution over the categories between the groups.

Note that the Chi-square test assumes that the expected cell frequencies should not be too small.

Chi-square

MCAR example

mcar <- mcar %>% mutate(X3c = ifelse(X3 > 0, 1, 0))
chisq.test(mcar$R1, mcar$X3c)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mcar$R1 and mcar$X3c
## X-squared = 0.062121, df = 1, p-value = 0.8032

Chi-square

MAR example

mar <- mar %>% mutate(X3c = ifelse(X3 > 0, 1, 0))
chisq.test(mar$R1, mar$X3c)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mar$R1 and mar$X3c
## X-squared = 80.787, df = 1, p-value < 2.2e-16

Chi-square

MNAR example

mnar <- mar %>% mutate(X3c = ifelse(X3 > 0, 1, 0))
chisq.test(mar$R1, mnar$X3c)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mar$R1 and mnar$X3c
## X-squared = 80.787, df = 1, p-value < 2.2e-16

Logistic regression to evaluate MCAR

The probability of missing data can also be investigated in a logistic regression analysis.

The missing data indicator is the dependent variable and the other variables that may be related to the probability of missing data are the independent variables.

The results of the logistic regression analysis show if the independent variables relate to the probability of missing data.

Note that when the other variables have missing values as well, a complete-case analysis is used per default.

Logistic regression

MCAR example

glm(R1 ~ X2 + X3, data = mcar) %>%
  summary %>% coefficients %>% round(.,3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.510      0.016  32.171    0.000
## X2             0.002      0.007   0.277    0.782
## X3            -0.006      0.007  -0.870    0.384

Logistic regression

MAR example

glm(R1 ~ X2 + X3, data = mar) %>%
  summary %>% coefficients %>% round(.,3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.483      0.014  34.573        0
## X2             0.078      0.006  12.460        0
## X3             0.056      0.006   8.909        0

Logistic regression

MNAR example

glm(R1 ~ X2 + X3, data = mnar) %>%
  summary %>% coefficients %>% round(.,3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.483      0.014  34.573        0
## X2             0.078      0.006  12.460        0
## X3             0.056      0.006   8.909        0

Logistic regression

In the MCAR example both X2 and X3 are not related to the probability of missing data in X1, so we may assume that the missing data in X1 are MCAR.

However, in the MAR example, both variables are related tot he probability of missing data in X1, so in that case we can assume that the data are not-MCAR.

We cannot rule-out MNAR in this situation, since cannot test the missing values itself.

Little’s MCAR test

  • A multivariate test that evaluates the subgroups of the data that share the same missing data pattern.

  • Per subgroup (with same missing data pattern): observed means versus estimated means based on the expectation-maximization algorithm.

  • Chi-square distribution test to test the null hypothesis that data are MCAR.

  • A significant result shows that the data are not-MCAR.

Little’s MCAR test

MCAR example

misty::na.test(mcar %>% select(X1:X3))
##  Little's MCAR Test
## 
##      n nIncomp nPattern chi2 df  pval 
##   1000     510        2 0.77  2 0.679

Little’s MCAR test

MAR example

misty::na.test(mar %>% select(X1:X3))
##  Little's MCAR Test
## 
##      n nIncomp nPattern   chi2 df  pval 
##   1000     488        2 222.52  2 0.000

Little’s MCAR test

MNAR example

misty::na.test(mnar %>% select(X1:X3))
##  Little's MCAR Test
## 
##      n nIncomp nPattern   chi2 df  pval 
##   1000     488        2 222.52  2 0.000

Little’s MCAR test notes

  • No specific information about which variables are related to the probability of missing data.

  • Test assumes multivariate normality and can only be applied to continuous variables.

  • The MNAR mechanism can never be ruled out, regardless of the result of the test.

Assuming the missing data mechanism (MCAR)

The methods to deal with missing data, implicitly assume a missing data mechanism.

MCAR: the most strict assumption. In practice it is also easiest to deal with MCAR data.

  1. Analyze the observed sample only (this will result in unbiased estimates).
  2. Use an imputation method to boost the power the the amount of missing data is too large.

Assuming the missing data mechanism (MAR)

MAR: less strict assumption. Most advanced missing data methods assume this mechanism (e.g. multiple imputation, FIML).

  • Include variables in study that may explain the missing data, a MAR assumption may become more plausible (as compared to MNAR).
  • These auxiliary variables may also help in dealing with the missing data.
  • Auxiliary variables: variables related to the probability of missing data or to the variable with missing data. Can be used as predictors in an imputation model or as covariates in a FIML model to improve estimations.

Assuming the missing data mechanism (MNAR)

MNAR: least strict assumption.

  • MNAR data are also referred to as non-ignorable, because these cannot be ignored without causing bias in results. MNAR data are more challenging to deal with.